1 Data structure

  • Individuals from 2591 SSC families were genotyped on three chips. Note that members of each family were analyzed on the same array.
    • Illumina 1Mv3 Duo (1189 families), 4626 people (2703 males, 1923 females), 1199033 SNPs
    • Illumina 1Mv1 (333 families), 1354 people (801 males, 553 females), 1072814 SNPs
    • Illumina HumanOmni2.5M (1069 families), 4240 people (2490 males, 1750 females), 2440283 SNPs
    • coded with ATGC, no need to recode
    • chromosome are from chr1 to chr26, of which autosome 1-22, X:23 & 25 will be imputed (** –output-chr M can recode as X instead of 23**)
    • SNP IDs: pay attention to Omini2.5, (–update-name from format as SNP(chr)-pos to kgp identifiers should be done first), later kgp identifiers should be converted to rsids.
    • All individuals are included for imputation
  • Imputation will be done separately for each chip.


2 Pre-imputation


2.1 Strand flip & liftover

  • From build 36 to build 37, from minus strand to plus strand)
  • SSC was assembled based on build 36 (confirmed with dbSNP)
  • strand files
    • Human1Mv1_C-b37.strand is used for both 1Mv1 and 1Mv3
    • HumanOmni2.5-8v1_A-b37.Source.strand is used for Omni2.5
  • Check the output files from update_build.sh
    • remove chr 24
    • recode chr 23 to chr X
    • update-name for kgp identifiers in Omni2.5 (when there are a multitude of SNPs coded as kgp)


2.2 Genotype QC

  • Filters: --geno 0.05 --hwe 1e-6 --mind 0.1 --maf 0.01.
  • Then recode to .vcf files (separate autosome and X-chromosome for now)

2.2.1 Illumina 1Mv3


  • 0 people removed due to missing genotype data (--mind).
  • Total genotyping rate is 0.990957.
  • 15033 variants removed due to missing genotype data (--geno).
  • 126158 variants removed due to minor allele threshold(s).
  • --hwe: 38392 variants removed due to Hardy-Weinberg exact test.
  • 859161 variants and 4626 people pass filters and QC.


2.2.2 Illumina 1Mv1


  • 0 people removed due to missing genotype data (--mind).
  • Total genotyping rate is 0.976306.
  • 29018 variants removed due to missing genotype data (--geno).
  • 136291 variants removed due to minor allele threshold(s).
  • --hwe: 6476 variants removed due to Hardy-Weinberg exact test.
  • 898063 variants and 1354 people pass filters and QC.


2.2.3 Illumina Omni2.5M


  • 0 people removed due to missing genotype data (--mind).
  • Total genotyping rate is 0.99696.
  • 7885 variants removed due to missing genotype data (--geno).
  • 723923 variants removed due to minor allele threshold(s).
  • --hwe: 53449 variants removed due to Hardy-Weinberg exact test.
  • 1575751 variants and 4240 people pass filters and QC.


2.3 Check and fix the REF allele

  • only for sanger imputation server
  • using bcftools
  • check with HRC build37

The whole pre-Imputation QC process stored in /gpfs1/scratch/group30days/cnsg_park/uqywan/auti/scripts/ssc_preImp_QC.R



2.4 Submit for imputation



3 After Imputation

  • recode .vcf to plink format
  • extract info scores in .info
  • fill missing SNP IDs in .bim/.info
  • update .fam file in case there are two "_" in the FID_IID
  • QC for each chip (filters: --info 0.03)
    • 1Mv1: 29,613,517 SNPs left
    • 1Mv3: 31,605,064 SNPs left
    • Omni2.5: 30,466,266 SNPs left
    • 23,460,315 SNPs overlapped
  • merge .bfile of three chips
    • remaining 22,274,458 SNPs after QC for --geno 0.05 --mac 1 --hwe 1e-6
    • remaining 7,480,061 SNPs after QC for --geno 0.05 --maf 0.01 --hwe 1e-6


3.1 Frequency distrubtion

  • overlapped SNPs between SSC_imputed and HRC (~7.2M SNPs with MAF > 0.01)
  • based on same allele
  • 0 SNPs with MAF difference > 0.2



3.2 PCA

  • Project the first 3 PCs based on pruned HapMap3 SNPs onto 1000G
  • Using K-means to calculate distance
  • Assign ancestry based on posterior probability 0.9

3.2.1 All individuals



3.2.2 Probands only

3.2.2.1 Self Reported VS Ancestry Assigned

AFR AMR EAS EUR SAS
african-amer 91 7 0 0 1
asian 0 1 60 0 47
more-than-one-race 11 100 2 44 45
native-american 0 4 0 1 0
native-hawaiian 0 0 1 1 0
not-specified 1 16 0 2 0
other 4 91 0 14 12
white 1 116 0 1907 8


3.2.2.2 PCA plot based on assigned ancestry



3.3 Relatedness in Probands

  • Generate GRM based on LD pruned SNPs
  • 2206 probands with pairwise GRM < 0.05


**The whole process stored in /gpfs1/scratch/group30days/cnsg_park/uqywan/auti/scripts/afterImp_QC*.R**